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Abstract 

Regression, weighting and related approaches to estimating a population mean from a 
sample with nonrandom missing data often rely on the assumption that conditional on 
covariates, observed samples can be treated as random. Standard methods using this as¬ 
sumption generally will fail to yield consistent estimators when covariates are measured 
with error. We review approaches to consistent estimation of a population mean of an in¬ 
completely observed variable using error-prone covariates, noting difficulties with applying 
these methods. We consider the application of Simulation-Extrapolation (SIMEX) as a 
simple and effective alternative. We provide technical conditions under which SIMEX will 
lead to a consistent estimator of a population mean and argue why it may function well 
in common settings. We use a simulation study to demonstrate its potential for removing 
nearly all of the bias in regression, weighting and doubly robust estimators for a population 
mean while maintaining precision competitive with what would be obtained without mea¬ 
surement error. We also discuss and evaluate options for estimating the standard error of 
the SIMEX mean estimator. Finally, we present an empirical example of estimating middle 
school effects on student achievement. 
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1. Introduction 

A common problem in applied research is the estimation of a population mean from a sample 
with nonrandom missing data. Applications include survey nonresponse in which variables 
of interest are missing for some units, and observational studies of causal effects in which 
each unit’s potential outcome under one of multiple possible treatment assignments is ob¬ 
served and the remaining potential outcomes are missing (Bang and Robins, 2005; Kang and 
Schafer, 2007; Lunceford and Davidian, 2004; Robins et al., 1994; Rubin, 1974; Scharfstein 
et al., 1999). A common way of constructing consistent estimators for a population mean 
in these settings is to assume that missing data are “ignorable” conditional on observed 
covariates. That is, although the observed outcome data may be a nonrandom sample from 
the full population, it is assumed that they are random samples from the subpopulations of 
units sharing the same values of the covariates. Estimators using this assumption include 
regression estimators, inverse probability-of-response weighted (IPW) estimators, matching 
or stratification estimators, and “doubly-robust” (DR) estimators that combine approaches 
to yield consistent estimators if either the regression or probability of response model are 
correct (Kang and Schafer, 2007; Rosenbaum and Rubin, 1983; Stuart, 2010). 

The success of these methods requires that the covariates necessary for missing data to 
be ignorable contain no measurement error (ME). When both outcomes and the probability 
of response or treatment assignment depend on latent quantities that are observed only 
through error-prone surrogates, modeling with the surrogates while ignoring ME can result 
in bias in weighting, matching, regression and related estimators (D’Agostino and Rubin, 
2000; Kuroki and Pearl, 2014; Lockwood and McCaffrey, 2015a; McCaffrey et al., 2013; 
Pearl, 2010; Regier et al., 2014; Steiner et al., 2011; Raykov, 2012; Yi et al., 2012). 

ME in key covariates is commonplace. For example, scores from standardized achieve¬ 
ment tests commonly are used to adjust for non-equivalent student groups in observational 
studies in educational research (Battauz and Bellio, 2011; Lockwood and McCaffrey, 2014), 
including the estimation of individual school and teacher “value-added” effects for account¬ 
ability purposes (Braun, 2005; Harris, 2011; McCaffrey et al., 2004a). Test scores are error- 
prone measures of latent achievement (Lord, 1980). It is clear that future achievement 
outcomes, and perhaps other outcomes of interest, depend on these latent achievement at¬ 
tributes, and thus observational treatment groups need to be balanced on these attributes in 
order to estimate treatment effects unbiasedly. Balancing the error-prone scores is generally 
insufficient to balance the latent attributes when treatment assignment depends in part on 
the latent quantities (Yi et al., 2012). For example, in a study of a computer-based algebra 
tutor (Pane et al., 2014), students were administered a study pre-test designed to assess 
algebra knowledge, but student selection into the intervention condition was based in part 
on decisions by school personnel that were made prior to the administration of the pre-test. 
The observed pre-test scores can at best proxy for the underlying attributes that differenti¬ 
ate intervention from control students. Therefore adjusting the groups to be equivalent on 
the latent constructs measured by the pre-test, rather on than the observed pre-test scores, 
would be necessary to argue for unbiased estimation of the intervention effect. 

ME occurs in other contexts such as survey responses, which are often used as covari¬ 
ates for observational studies. For example, McCaffrey et al. (2010) used student survey 
responses to questions about attitudes, dispositions, and alcohol, cigarette and marijuana 
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use at grade 7 to control for differences among students who did or did not have heavy 
marijuana use at grades 9 and 10 in a study of the effect of marijuana use on high school 
graduation. However, there is a substantial literature on the errors in student survey data 
including self-reported drug use (Freier et al., 1991; Martino et al., 2009). 

Unfortunately, methods to correct regression, IPW, DR and other estimators of popula¬ 
tion means for covariate ME tend to be complicated. For example, McCaffrey et ah (2013) 
provide sufficient conditions for weights based on error-prone covariates to provide con¬ 
sistent estimates when outcomes and the probability of response or treatment assignment 
depend on latent covariates. They also develop a computational approach for calculating 
weights that satisfy the required conditions. It involves two steps: 1) using nonlinear ME 
modeling with the observed data to estimate the parameters of the true propensity score 
function of the latent covariates; and 2) using Monte Carlo integration to solve an integral 
equation for a function of the observed covariates that is unbiased for the corresponding 
inverse of the propensity score function evaluated at the latent covariates. The first step 
can be difficult. For the second step, the class of problems for which unbiased estimating 
functions can be computed in closed form is limited (Stefanski, 1989) and approximating the 
required functions also can be difficult. Approaches similar to that developed by McCaffrey 
et al. (2013) can be used to correct DR estimators for covariate ME but require even more 
assumptions, additional nonlinear ME modeling, and additional searching for unbiased es¬ 
timating functions (McCaffrey and Lockwood, 2014). Corrections for recently proposed 
estimation methods, such as the covariate balancing propensity score method (Imai and 
Ratkovic, 2014) or propensity score weighting combined with exact balancing of selected 
covariates (Haberman, 1984; Hainmueller, 2012; Kim, 2010), have yet to be developed, but 
are likely to face similar or even greater challenges. 

In this article we argue that Simulation-Extrapolation (SIMEX) may be a more practical 
and accessible method for estimating a population mean with incomplete data when covari¬ 
ates are measured with error. SIMEX is a general method for ME correction that provides 
approximately consistent estimators in models where other ME correction methods would 
be intractable (Carroll et al., 2006; Cook and Stefanski, 1994). SIMEX is often applied to 
estimating the parameters of generalized linear models with error-prone covariates (Cook 
and Stefanski, 1994; Fung and Krewski, 1999). In the causal inference literature, Valeri 
et al. (2014) use SIMEX to conduct mediation analysis in generalized linear models when 
a continuous mediator is measured with error. We are unaware of its application to the 
estimation of population means with nonresponse and covariate ME. 

SIMEX has several potential advantages over other approaches for using error prone co¬ 
variates to account for nonrandom missingness when estimating a population mean. First, 
it requires no specialized ME modeling or solutions to integral equations. It can be im¬ 
plemented with standard statistical routines that do not correct for ME, making it widely 
practical and accessible. This is true even in complex settings with multiple error-prone 
and error-free covariates, or with complicated estimators that combine multiple stages of 
weighting and regression, where alternative approaches for ME correction may be unclear. 
Second, it requires no distributional assumptions about latent covariates and their rela¬ 
tionships to other covariates, which alternatives often use. Finally, as argued by Cook and 
Stefanski (1994), implementing SIMEX provides an automatic way of studying the sensi- 
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tivity of inferences to covariate ME. These features, combined with the relative complexity 
of alternative approaches to the problem, make SIMEX potentially attractive in practice. 

In Section 2 we establish notation and assumptions about the observed and latent data. 
In Section 3 we review regression, IPW and DR methods for consistent estimation of a pop¬ 
ulation mean with incomplete data and covariate ME. In Section 4 we review the SIMEX 
method, provide technical conditions under which it will lead to a consistent estimator of 
a population mean with incomplete data and covariate ME, suggest a method for assessing 
whether these conditions are likely to hold in a given setting, and discuss standard error es¬ 
timation. In Section 5 we demonstrate the successful performance of SIMEX for regression, 
IPW and DR estimators in a simulation study. We then present an empirical example of 
the estimation of middle school effects on student achievement outcomes in Section 6, and 
conclude with a discussion and suggestions for future research in Section 7. 

2. Notation and Model Assumptions 

We let (Y, R, Z, A, W) be random quantities with a proper joint distribution on a probability 
space. We assume we are dealing with independent and identically distributed (IID) samples 
of size n from this joint distribution corresponding to individual units in a population, and 
subscript units by i when necessary. The dichotomous response indicator R is observed 
for all units. The outcome of interest is Y, which is observed for units when R = 1 but 
is missing when R = 0. We assume that the goal is to estimate the population mean 
/r* := E[Y] and assume this is finite. 

This framework covers several common applications. In survey sampling, Y is an out¬ 
come of interest and R indicates whether or not this outcome is observed for a sampled 
unit. In longitudinal settings, R = 1 can indicate units for which an outcome Y from a 
given timepoint is observed, and R = 0 can indicate units who have dropped out by that 
timepoint. In causal inference with observational data, there are two outcomes of interest: 
Yo, the potential outcome of each unit had it received control (T = 0), and Y \, the potential 
outcome of each unit had it received treatment (T = 1). The population average treatment 
effect is defined as E\Y{\ — E[Yo\, so the population means of both potential outcomes must 
be estimated. However Y\ is observed only for treatment units and Yo is observed only for 
control units, so there are missing data for both outcomes. When estimating E[Yi], Y refers 
to Y\ and R = T, and when estimating E[Yq], Y refers to Yo and R = (1 — T). 

The problem in all of these cases is that E[Y \ R = 1] does not generally equal /i* because 
the observed values of Y are not necessarily a random sample from the population. To get 
around this problem, we assume the covariates (A, Z) are sufficient for “strong ignorability” 
of response for inference about Y (Imbens, 2000; Rosenbaum and Rubin, 1983): 

Y is independent of R given (A, Z) and p(R = 1 | A, Z) > 0 for all (A, Z). (1) 

Following Kang and Schafer (2007), we denote the propensity score p(R = 1 | A, Z) by 
7r(A, Z), and we denote the conditional expectation function E[Y | A, Z\ by m(A, Z). 

Although we assume that (A, Z) is required for strong ignorability, we assume that the 
observed covariates for each unit are (W, Z) where W is the error-prone measure of A, and 
A is not observed for any unit. This is in contrast to a case where A might be partially 
missing for some units (Rosenbaum and Rubin, 1984; D’Agostino and Rubin, 2000) because 
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X is never observed. Observed covariates that can be treated as error-free are denoted by 
Z. Each of W, X and Z can be vectors but we do not explicitly use vector notation. 

We assume that W = X + U for U ~ 1V(0, a 2 ), where a 2 is either known, or is estimated 
sufficiently well from auxiliary data (e.g. replicate measurements or a validation dataset) 
to be treated as known. Normality of the ME is typically assumed in SIMEX, though the 
method is purported to be robust to deviations from normality (Carroll et ah, 2006). We 
further assume that U is independent of ( Y, R ) given ( X , Z), so that ME is non-differential 
(Carroll et al., 2006) and W satisfies “strong surrogacy” (Lockwood and McCaffrey, 2015a). 
This implies that W would be irrelevant to the analysis if X were observed. This assumption 
requires scrutiny in every context, and Lockwood and McCaffrey (2015a) discuss scenarios 
in which it may be more or less likely to hold. 

2.1 Simulated Example 

We now introduce a simulated example that we use throughout the article. The scenario 
has three scalar covariates X, Z\ and Z<i- The distribution of X is a mixture of normal 
distributions. It is skewed, multi-modal and scaled to have mean zero and variance one. 
The variable Z\ is correlated 0.3 with X and normally distributed conditional on X. Its 
marginal mean and variance are also zero and one. The variable Z 2 is dichotomous with 
mean 0.5 and independent of (X, Z\). 

We generate R using the model in the simulation example in McCaffrey et al. (2013). 
That example specified R as Bernoulli with 

p(R = 1 | X = x, Z\ = z\, Z 2 = Z 2 ) = G(0.5 + 1.2x + 0.5zi — 1.0^2 + OXXZ 2 ) (2) 

where G is the CDF of a Cauchy random variable. This model imposes clear differences 
between samples with R = 0 and R = 1 on the distributions of all covariates: compared to 
the sample with R = 0, the means for the sample with R = 1 are almost 1 SD unit higher 
on X , 0.57 SD units higher on Z\ and about 0.16 lower on Z^- 

We consider two outcome variables. The first is Y = a+bX + cZ\+dZ 2 + e, e ~ 1V(0 ,t 2 ) 
and independent of everything, where ( a,b,c,d,r 2 ) are set so that E[Y] = 0, Var[F] = 1, 
the covariates explain 80% of the variance in Y, the mean for R = 1 cases is about 0.43, and 
the mean for R = 0 cases is about —0.42. Thus R = 1 and R = 0 cases differ substantially 
on the distribution of Y. We refer to this as the “linear outcome” because its conditional 
mean function is linear in the covariates. We also consider Y* = (0.812 + 1.255T)7(0.812 + 
1.255y > 0) where / is the indicator function. We chose 0.812 and 1.255 so that both E[Y*] 
and Var[y*] are approximately 1. Among R = 0 cases, Y* has mean of about 0.58 and 
probability about 0.39 of equaling zero, compared to a mean of about 1.42 and probability 
about 0.13 of equaling zero among R = 1 cases, and so is clearly differentiated between 
the two groups. We refer to Y* as the “tobit outcome” because it follows a tobit model 
conditional on the covariates. 

Finally, we generate the measurement errors U as independent of everything and dis¬ 
tributed as 77(0, cr 2 = 0.176). The resulting reliability of W as a measure of X is 0.85, 
computed as Var[X]/(Var[X] + a 2 ) (Crocker and Algina, 1986). This can be interpreted 
as the correlation of two hypothetical measurements Wi and W 2 of the same X where 
W\ and W 2 have independent measurement errors. A reliability of 0.85 is consistent with 
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those of standardized test scores from state K-12 assessments. The joint distributions of 
(Y, R, Z i, Z 2 , X, W) and of (Y*, R, Z\. Z 2 , X, W) both meet the strong ignorability and sur¬ 
rogacy conditions specified previously in this section. The code for the R environment (R 
Development Core Team, 2015) that we use to simulate data from these distributions is 
provided in the Appendix. The simulated data are summarized in Table 1 for reference. 


Variable Description 

Y Linear outcome with E[Y] = 0, Var[V] = 1, and E[Y\R = 1] « 0.43 

Y* Tobit outcome with E[Y*] = 1, Var[V*] = 1, and E[Y*\R = 1] ~ 1.42 

R Response indicator; R = 1 indicates outcome is observed 

Z\ , Z 2 Error-free covariates 

X Unobserved covariate with E[X] = 0 and Var[A] = 1 

W Observed proxy for X with normal ME and reliability 0.85 

Table 1: Summary of random variables from the simulated example. 

3. Review of Approaches for Consistent Estimation 

Kang & Schafer (2007) summarize regression, IPW and DR approaches for estimating 
in the standard (i.e., no ME) case where ( X , Z) is observed for all units. In this section we 
review these approaches and discuss how they can be expressed as solutions to estimating 
equations. We use this correspondence to discuss analogs to these approaches when (IT, Z) 
is instead observed for all units and the other assumptions of the previous section hold. 

3.1 Regression 

If Xi were observed and if the conditional mean function rn(X, Z) = E[Y \ X, Z] were 
known, then estimating the population mean /r* would be straightforward. By the law of 
iterated expectation, /i* = E[E[Y \ X, Z]] so that (1/n) YH=i m (A*, Zi) would consistently 
estimate the mean. In practice, m(X, Z) is unknown and must be estimated using only 
the observed values of Y. This is possible with the strong ignorability assumption in (1), 
which implies that E[Y \ X. Z\ = E[Y \ X, Z, R = 1], which can be estimated because Y is 
observed for all R = 1 units. In practice it is often assumed that m{X,Z) = m(X, Z, 5*) 
for a parametric model fh(X, Z, 5), where 6 is a K -dimensional vector of model parameters 
with true value 5*. Kang & Schafer (2007) define (1/n) YHi =1 fh( Xi , Zi,5) as the regression 
estimator of /r*, where 5 is an estimate of d*. This estimator is consistent for / 1 * when 
m(X, Z, 6) is correctly specified, 6 is consistent for 5*, strong ignorability holds, and general 
regularity conditions hold (see Tsiatis (2007) for an example of these regularity conditions). 

The two stages of estimating <5* with 6 and then using the m(X, Z , 5) to estimate /r* can 
be expressed in one stage as the solution to an estimating equation 

n 

(l/ n ) V’O'ii Ei, Zi,Xi,9) = 0. (3) 

1=1 

Here i^{Y, R, Z,X,9) is a K+l dimensional function with one function corresponding to each 
of the parameters in 9 = ([J.,5')' with true value 0* = (/r*,#))'. The key idea of estimating 
equations is that if ip(Y, R, Z, X, 9) is specified such that 0* is the unique value of 9 with 
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the property that E[ip(Y, R, Z, X, 0*)] = 0, and other regularity conditions hold, then the 
solution to Equation 3 consistently estimates 0* (Stefanski and Boos, 2002). We assume the 
first component ip(Y, R , Z, X, 9)\ corresponds to n so that R, Z, X , 9)\ = rh(X, Z, b)—fi. 
This satisfies the key requirement because E\ip{Y, R, Z, X, 0*)i] = E[rh(X, Z,5*)\ — (jl* = 
E\E[Y\X, Z]J — //*. The remaining components of ip depend on the function fh and the 
method used to estimate its parameters <5. Later we consider the case where fh is linear 
and b is estimated with least squares, but nonlinear, generalized linear and other estimation 
methods also fit this form (see, e.g. Lunceford and Davidian 2004 and references therein). 

When (Y, R, Z,W) is observed instead of (Y,R,Z,X), the estimating function must 
be modified from ip(Y, R, Z, X, 9) to tp(Y, R, Z, W, 9) such that E[tp(Y, R, Z,W,9*)] = 0. 
Carroll et al. (2006) provide equations for the terms associated with <5. For fi we note 
that if fh is replaced by B(W,Z,6) such that E[B(W, Z, b) \ X. Z] = fh(X, Z, b), then 
tp(Y, R , Z, W,9)\ = B(W, Z,6) — n meets the sufficient equality condition. This is because 
B(W, Z, <5) has the property that E[B(W, Z, (5*) | X,Z\ = m(X, Z) so that E[B(W, Z, 5*)] — 
/i* = 0. This idea naturally extends the approach of McCaffrey et al. (2013) which is 
a special case of the general “corrected score function” or “unbiased estimating function” 
approach to ME correction (Carroll et al., 2006). 

In the case of a linear model, a plug-in approach would work: if m(X, Z) = <5*o + 
(5* \X + b^Z then B(W, Z, S) = 5q + b\W + b-zZ satisfies the requirements of an unbiased 
estimating function. The parameters <5* can be estimated consistently from the observed 
data by b obtained using standard approaches for correcting a linear model for covariate ME 
(Fuller, 2006), and then the consistent regression estimator for /r* is (1/n) Ya= i Z,;, b). 

With normally distributed ME, a closed form for B(W, Z, 6) is also possible when m(X, Z) 
is a polynomial using the results of Stefanski (1989). Another case where a closed form 
exists with normally distributed ME is the exponential function: if m(X, Z) = exp(<5*o + 
(5*i X + 5*2Z), then results using the normal moment generating function indicate that 
B(W,Z,b) = exp(<5 0 + biW + b 2 Z - b\a 2 /2) satisfies E{B(W,Z,b*) \ X, Z] = m(X,Z). 
Outside of these cases, finding B(W,Z,b) typically requires Monte Carlo approximations 
such as those suggested by McCaffrey et al. (2013) or the general Monte Carlo methods 
using complex variables described by Novick and Stefanski (2002) and Carroll et al. (2006). 
These methods are not trivial to implement and yield approximate solutions whose validity 
must be checked carefully in any given setting. 


3.2 Inverse Probability-of-Response Weighting 


IPW estimators also rely on assumptions about the ignorability of response given observed 
covariates. However, these estimators avoid making explicit assumptions about the condi¬ 
tional mean of the outcome given those covariates. In fact, one of the purported advantages 
of weighting is the ability to make adjustments for imbalance in observed covariates be¬ 
tween respondents and nonrespondents without reference to the outcomes, and possibly 
before they are even observed (Rubin, 2001). If the probability of response were known 


and Xi was observed, then the IPW estimator for //* would equal 


SIU Rj-K- 1 (x i ,z i )Y i 


which 


£"=i Rur-Hx^Zi) 

is closely related to the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). The 
intuition is that the estimator is a weighted mean of the observed Y t , where the weights 
are chosen to reweight the (Xj. Zj) distribution among these cases to match the population 
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distribution of ( X, Z). This, combined with the strong ignorability assumption, allows the 
weighted mean of the observed Y* to consistently estimate the population mean /x*. Typi¬ 
cally, the response probabilities are unknown, so a functional form for the propensity score 
is selected and the parameters of this model are then estimated from observed data. 

Again, the two stages of estimating the parameters of the propensity score model and 
then using the estimated propensity scores to calculate the weighted mean can be written 
in one stage as the solution to an estimating equation. We let a be the parameters of the 
working model rf(X,Z,a) and let q(X,Z,a) = 1/7 f(X,Z,a). The IPW estimator solves 
(1/n) £/=i -Rj Z, X, 9) = 0, where now 9 = (/x, c/)' with true value = (/x*,a*y, 
and where ip(Y, R, Z, X, 9)\ = Rq(X, Z, a)[Y — /x]. Provided n(X, Z, a*) = tt(X, Z) and a 
consistently estimates a*, the IPW estimator is consistent and asymptotically normal. 

When X % is unobserved, even if we know the true propensity score function tt(X, Z) 
or can estimate it consistently from the observed data, we still do not know where to 
evaluate it. Dealing with this problem again uses the logic of unbiased estimating func¬ 
tions. McCaffrey et al. (2013) show that if q(X, Z, a) is replaced by A(W, Z, a) where 
E[A(W, Z, a) | X, Z\ = q(X, Z, a), then the solution to the resulting estimating equation is 
consistent if 7 r(X, Z, a*) = tt(X,Z) and a consistently estimates ct*. The idea is that the 
weighted estimator using (W, Z) is constructed so that each case is given a weight that is un¬ 
biased for its true, unobserved weight determined by ( X , Z). A closed form for A(W, Z, a) 
exists for logistic propensity score functions with normal ME using the normal moment 
generating function result noted previously. Outside of this case, numerical methods for 
function approximation are generally required. 


3.3 Doubly Robust 

Doubly robust estimators combine a model fh for the conditional mean of the outcome and 
a model n for the probability of response to provide estimators for /x* that are consistent if 
either model, but possibly not both, is correct (Bang and Robins, 2005). This can insure 
against model misspecification. Kang and Schafer (2007) review several ways to combine 
the two models to create a doubly robust estimator. Here we focus on one. When X{ is 
observed, the “bias corrected” DR estimator for /x* noted by Kang and Schafer (2007) is 


(1/n) ^m(Xj, Zi,S) + 

i =1 


£?=i Ri[Yi - fh(Xi, Z u S)]q(Xi, Z u a) 
Ei=i Riq{X u Zi,a) 


(4) 


where, as previously defined, rh(X, Z, 6) is a working model for the conditional mean 
and q(X, Z, a) is a weighting function equal to the reciprocal of the working model for 
the propensity score function 1 This estimator is the solution to an estimating equation 
(1/n) ip(Yi, Ri , Zi,Xi, 9) = 0, where now 9 = (/x, 6', a')' with true value 0* = (/x*, 5', a')', 
and where ij)(Y, R, Z, X , 9) 1 = Rq(X, Z, ot){Y — m(X, Z, d)) + (rh(X, Z, 6) — /x). This esti¬ 
mator is consistent and asymptotically normal provided either q(X, Z, a*) = tt(X,Z)~ 1 
or rh(X, Z,5*) = m(X,Z). The intuition for this estimator is that when m is correctly 
specified, the term on the right in Equation 4 converges to zero and the term on the left 

1. Kang and Schafer (2007) present (4) as a variation of + (1/n) Ri[Y — 

rh(Xi, Zi, 8)\q(Xi, Zi,a). As with the IPW estimator, normalizing the weights tends to improve the 
precision of the estimator and is commonly used in practice. 
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converges to //* regardless of whether 7 f is correctly specified, whereas when if is correct, 
the term on the left converges to a possibly incorrect value jl while the term on the right 
corrects this bias by converging to /i* — p,. 

When X is measured with error by W, satisfying surrogacy, then again this approach can 
be extended using unbiased estimating functions. If functions A(W, Z, a), B(W,Z,8), and 
C(W, Z, a, <5) satisfy E[A(W, Z, a) \ X.Z\ = q(X, Z, a), E[B(W, Z, 8) \ X,Z\ = rh(X, Z, 8) 
and E[C(W, Z, a, 8) j X, Z] = fh(X, Z, 8)q(X , Z, a), then 

E[R(A(W, Z, a)Y - C(W, Z, a , <5))] + E[B(W, Z , 5)] = E[Y\ 

if either q(X, Z,a*) = 7 r(X, Z) _1 or rh(X, Z,8*) = m(X,Z). The estimating function 
ip(Y, R, Z, X, 0) 1 for /x* is replaced by tp(y, R, Z, W, 0)i = R(A(W, Z, a)Y - C(W , Z, a, 5)) + 
(B(W,Z,8) — fi) which has mean zero for // = //* if either q(X, Z, a*) = ir(X, Z)~ l or 
rh(X, Z, 8*) = m(X,Z). Using this result to obtain an estimator requires estimating the 
parameters of two potentially nonlinear models of (X, Z) from the observed data and then 
solving three integral equations. If X given Z, and W given (X, Z), are both normally 
distributed, then closed-form expressions exist for all three functions. However, solutions 
do not exist for some values of the parameters, and the practical performance of these 
estimators has not been studied. Similar ideas can be used to extend the DR estimator 
of Bang and Robins (2005) that includes the reciprocal of the estimated propensity score 
as a covariate in rh(X, Z) to the case of ME. Extending the DR approach noted by Kang 
and Schafer (2007) that uses weighted regression with inverse propensity score weights is 
possible, but may be impractical because of the potentially large number of functions of 
(X, Z) that must be estimated and for which corresponding unbiased estimating functions 
must be found. 

3.4 Summary 

The general method of unbiased estimating functions can be used to correct certain regres¬ 
sion, IPW and DR estimators for covariate ME. However the methods generally require 
specialized ME modeling and the solution or approximation of unbiased estimating func¬ 
tions that must be tailored to the particular estimator being used. These steps can be 
difficult. For example, for the simulated data scenario described in Section 2 which includes 
a complicated marginal distribution for X, a Cauchy link for the propensity score function 
that includes an interaction term involving X, and a nonlinear conditional mean function 
(for the tobit outcome), closed form solutions for either estimating the model parameters 
or determining unbiased estimating functions are unlikely to be available for any of the 
common mean estimators. In the remaining sections we consider SIMEX as an alternative 
approach that is more practical and accessible. 

4. SIMEX for Estimation of a Population Mean 

In this section we 1) review the SIMEX method; 2) provide technical conditions under 
which it will lead to a consistent estimator of a population mean with incomplete data and 
covariate ME; 3) provide an example in which these conditions can be verified; 4) suggest 
a method for assessing whether these conditions are likely to hold in other settings; and 5) 
discuss standard error estimation. 
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4.1 Review of SIMEX 

To describe SIMEX, we introduce notation similar to that used by Devanarayan and Stefan- 
ski (2002) and Lockwood and McCaffrey (2015b). We consider hypothetical data in which 
X is measured with ME with variance (1 + A) a 2 for A > — 1, where a 2 is the variance of 
the ME in the observed data. The resulting vector of data (Y, R, Z, X, VF( 1 + a)) has a joint 
distribution that we denote by P( i+a)- At A = 0, (1 + A)cr 2 = <r 2 , and so (Y. R, Z, X, W^) 
has the same distribution as the observed data. When A > 0 the hypothetical data have 
larger ME than the observed data. At the other extreme, W^y corresponding to A = — 1, 
denotes hypothetical data with no ME so that W(yy = X, and P^ 0 ) equals the distribution 
of (Y, R, Z, X). We assume that (Y), Ri, Zi, AQ, Wy+xy) for i = 1 ,...,n are independent 
and identically distributed with distribution P(i+\) and we use {!), Ri, Zi, Wm*} to denote 
the observed data (Y\,R\, Z\,W(iy), (I2, P2, Z-), 1E( 1 ) 2 ),..., (Y n , R n , Z n , W^y n ) across units 
i = 1 ,... ,n, where it is understood that Yj is observed only when /?, = !. When there is 
no risk for confusion, we drop the subscripting by i to simplify notation. 

As discussed in the previous section, if there were no ME, the estimator of 9 n of 0* 
is the solution to (1/n) JA ip(Yi, Ri, Zi, AQ, 9) = 0. We assume for the remainder that 
ip(Y, R, Z, X, 9) is correctly specified and that other standard regularity conditions hold so 
that 9 n 9*. We refer to the “naive” estimator of 0* as the solution to the equation 
(l/ n ) Yli i’O'i, Ri, Zi, Wny, 9) = 0 which ignores ME by plugging Wyy into the equation 
in place of AQ. We use 0m n to denote this estimator. For large samples, 9yy n converges 
to the solution to E[ijj(Y, R, Z, Wm, 0)\ = 0, which we call 0(iy Following this notational 
convention, we denote the solution to E[i/)(Y, R, Z, Wm+a)) $)] = 0 by 0 (i+a) f° r ^ > —1- We 
denote liniA-,-1 0(i+a) by O^y Under technical assumptions that we discuss later, 0( O ) = 

If this holds, then the logic of SIMEX is easy to state: if we had samples from P(i+a) f° r 
various values of A > 0, which can be generated by adding extra ME to the observed data, 
we could estimate $(i+a) an d develop a model for how $(i+a) varies as a function of A. 
Evaluating this model at A = — 1 would then give us an estimate of 0*. 

The SIMEX algorithm proceeds as follows. First a sequence of values Ai,...,Al is 
established, ranging from 0 to an upper bound commonly taken as 2 (Cook and Stefanski, 
1994), with L = 20 typically providing a sufficiently fine grid. Then for each l = 1 ,... ,L 
synthetic covariates y b = W(i )i+U* X yj b are generated, where the simulated b ~ 

IID N(0, Xga 2 ). The simulation step is repeated for b = 1 ,B, where B is as large as 
desired to reduce Monte Carlo error to an acceptably low level. The synthetic covariates thus 
have larger ME variance of (1 + A^cr 2 compared to cr 2 for the Wmj themselves. In fact, due 
to the assumptions of surrogacy and normal ME, {!), Ri, Zi,Xi, VU* 1+A ^ & } are independent 
and identically distributed according to P(i+a £ ) for each b. This is the key property of the 
simulated data from SIMEX: the synthetic data have the same distribution as hypothetical 
data with ME variance (1 + Xe)a 2 rather than cr 2 . Therefore the naive estimator 0(i+A £ ),6,n> 
equal to the solution to (1 /n)Yli' l l’(Yi, Ri, Zi,W* 1+Xi y b ) = 0, is a consistent estimator of 
0 ( i +A a. This is true for each b = 1,..., B, and in practice unnecessary noise in this estimator 
is reduced by averaging it over the b = 1 ,... ,B sets of simulated measurement errors. We 
denote this average by 

B 

^(l+A e),B,n 0-/B) ^ ^ (1+A £ ),fe,n- 

6=1 
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This step is performed for Ai,..., Al, which leads to the sequence of synthetic estimators 
0{l+\l),B,n ■■■1 0(1+A L ),B,n- 

The hnal step of SIMEX fits a parametric function Q{ A, 7 ) to 0( 1+Al ) g n ,..., g n 

to estimate 7 n and then extrapolates the prediction from that model back to A = —1 to 
estimate the target by Q{— l,7n)- This extrapolated value serves as the final estimator. The 
extrapolation step is the “Achilles Heel” of SIMEX because the true functional relationship 
between A and $(i+a) is typically unknown and therefore the extrapolation, estimated from 
the observed data, will be only approximate (Cook and Stefanski, 1994). 

Figure 1 demonstrates a hypothetical SIMEX projection. The naive estimator 0(i) jn 
using observed data is located at A = 0. The averages fb 1+A ^ g n of the naive estimators 
across B Monte Carlo replications using synthetic data with additional ME are located at 
each selected A^ > 0. A curve is fitted to these data points and then extrapolated back to 
A = — 1 to obtain the SIMEX estimator. The goal is to use the extrapolation to reduce or 
remove bias in the naive estimator as an estimator of #*. 

The SIMEX algorithm repeatedly uses naive estimators. It avoids the complexities of 
finding unbiased estimating equations of Section 3. The standard software and models that 
would be used if there were no ME are all that are required. The trade-off is that the 
extrapolation function must be modeled and may not be exact. 

4.2 Conditions for Consistent Estimation 

The general theory of estimating equations can be used to establish conditions under which 
SIMEX will yield a consistent estimator of a population mean. General arguments sup¬ 
porting the consistency and asymptotic normality of the SIMEX estimator for cases where 
estimators can be expressed as solutions to estimating equations, such as the estimators of 
the mean described in Section 3, are provided by Cook and Stefanski (1994), Stefanski and 
Cook (1995) and Carroll et al. (1996). In this section we establish technical conditions un¬ 
der which SIMEX will lead to a consistent estimator of a population mean with incomplete 
data and covariate ME, along the way providing some general results for SIMEX that build 
on previous work by Cook and Stefanski (1994) and Carroll et al. (1996). 

Carroll et al. (1996) show that under general regularity conditions, like those required 
for the solution of Equation 3 to be consistent, linin^o ^( 1+A ) b n = #(i+a)- This is true 

for each value of A. We let +x)Bn denote the first element of #( 1+ A)Sn’ the estimate 
of n, and /U(i+a) equal its limit. For extrapolation step, we choose a parametric model 
for /i( 1+A ), which we call Q( A, 7 ). For example, the function may be quadratic so that 
Q( A, 7 ) = 70 + 71 A + 72 A 2 . We can extrapolate all the parameters, but our interest is in /i*, 
so we focus only on the extrapolation of V‘(i+x)Bn- We estimate the parameters of Q by 
fitting the model to the i = 1,..., L simulation-based estimates, Bn' f° r the chosen 

values. The resulting parameter estimates can be written as: 7 n = b n> 

where 07 are vectors that depend only on the values of A^ and do not depend on the 
observed data or the sample size. The SIMEX estimator of /r* is Q(— l, 7 n ). As n —> 00 , 
In 7 = Ya =1 i+x e ) an d, by Slutsky’s theorem (Serfling, 1980), -4 £(- 1 , 7 ). 

We have not assumed that Q is correctly specified, so consequently we cannot assume that 
Q(— 1 , 7 ) equals /i( 0 ), the limit of /R(i+a) as A approaches —1. However, if Q is correctly 
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Example of SIMEX Projection 



X 


Figure 1: Example of SIMEX projection. The naive estimator of using observed data is 
located at A = 0. Averages of naive estimators across B Monte Carlo replications 
using synthetic data are located at each selected A > 0. The SIMEX estimator is 
obtained by fitting a curve to the naive estimators and projecting that function, 
located at A = — 1. 


specified so that = G{ A, 7 +) for some 7 *, then = 7 *> provided L is 

greater than the dimension of 7 . Consequently, if we specify the extrapolation function 
correctly, then G(— l,7n) £?(— 1,7*) and Q[— 1,7*) = /i( 0 )- Now all we need for consis¬ 
tency is /Z(o) = /r*; that is, must converge to /i* as A approaches —1. Recall that 

/i(i + A) is the first element of $(i+a)- Thus, if limA^_i ^(i+a) = that is, if the solution to 
E[ip(Y, R , Z , W(x + a), 9)} = 0 converges to the solution of E[ip(Y, R, Z, X, 0)] = 0, and we cor¬ 
rectly specify the extrapolation function, then Q(— 1 , 7 ) equals /i* and the SIMEX estimator 
is consistent. We provide sufficient conditions for the solution to E[ip(Y, R , Z, VF( 1 + a), 0)] = 0 
to converge to the solution of E[ip(Y, R, Z, X, 9)] = 0 in the Appendix. These conditions 
apply to SIMEX estimation in any parametric model where parameter estimates can be 
expressed as solutions to estimating equations, not only the mean estimation problem. 
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4.3 Example: SIMEX for the Regression Estimator of the Mean 

As an example to provide intuition about how the solution to E[i(:(Y. R, Z, VF( 1+ a), 0)] = 0 
can converge to the solution of E[iJ)(Y, R , Z, X, 0)] = 0, we show it holds for the case of the 
regression estimator for a mean with a linear model with only one covariate X measured by 
W = X + U. We assume Y = <5*o + S*\X + e, where e has mean zero and is independent of 
X and R. We assume the linear model coefficients are estimated by least squares and the 
mean for the population is predicted using those estimates. Then ijj has three components: 

/ fi-8 0 - <5iX 

il}{Y,R,X,0)=\ (Y-So- SiX)R 
\ X(Y -6 0 - 5\X)R 

and 0 = (/i, So, 5 1 ). This gives that E[^(Y, R, W( 1+ a), 0)] equals 


( m-5 0 -Si£[W ( 1+a) ] \ 

[ (E[Y \R = 1]-5o- 6 1 E[W {1+x) \ R = l})p(R = l) . (5) 

V (E[W {1+x) Y \R = 1}- 5 0 E[W (1+x) \R = l]~ 5 ! E[wf 1+X) \ R = 1 ])p(R = 1) J 

Details given in the Appendix show that the solution to E[ip(Y, R, W( 1+A ), 0)] = 0 has 


<5*iu 2 


/i (1+A) - + <5*i(l - v2 + {1 + X)a2 )E[X \R-1} + u2 + {1 + X)a2 E[X} 


where u 2 = E[X 2 j R = 1] — E[X \ R = l] 2 . It is clear that, as A approaches —1, /i(i+ A ) will 
converge to ^o+^iEfA], which would be the value from the solution to E[i/j(Y, R, X, 0)] = 0 
and which is the population mean under the assumed model. 


4.4 Monte Carlo Verification of Conditions 

The linear regression example is atypical in that E[ijj(Y, R, W( 1+A ), 0)] and its unique zero 
can be computed in closed form. In general these operations will be intractable. Thus it 
will not be feasible to directly verify, as we did above, that the required limiting behavior 
as A —► — 1 holds. Similarly, the sufficient conditions in the Appendix often cannot be 
verified. However, if and thus -P(i+a) are known, it is possible to check by simulation 
that the required conditions hold for any estimation method (i.e. any given set of estimating 
equations). Suppose that for select values of A > — 1, including —1, we can draw samples 
from P( 1+A ). By the large sample results in the previous section, for very large n, the 

solution 0(i+a) to n _1 ^(Y), Rj, Zj, W(i + A)i) = 0 is essentially equal to the solution 

0(i + A) to E[i/j(Y, R, Z, W (1+A) )] = 0. Thus, we can apply our estimation method to the very 
large simulated data set for each value of A to obtain 0 (i+a) an d trace these as a function of 
A to verify convergence as A —> — 1, and to explore the shape of the extrapolation function. 
Because the data are simulated, data for —1 < A < 0 can be generated. 

We used this procedure to verify that the required convergence holds for various mean 
estimators in the simulated scenario described in Section 2. We generated a sample with 
n = 10 7 observations. We generated W(i + a) for 300 equally-spaced A values between —1 and 
2 inclusive. Figure 2 traces the solution to E[ip(Y*, R, Z, IF(i + a)> 0)] = 0 as a function of A 
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Figure 2: Extrapolation functions for five estimators for E[Y*]: 1) the regression estimator; 

2) the IPW estimator; 3) the bias-corrected DR estimator with correct propensity 
score function and correct regression function; 4) the bias-corrected DR estimator 
with correct propensity score function and incorrect regression function; and 5) 
the bias-corrected DR estimator with incorrect propensity score function and 
correct regression function. 


for five estimators for E[Y*], the mean of the tobit outcome: 1) the regression estimator; 
2) the IPW estimator; 3) the bias-corrected DR estimator with correct propensity score 
function and correct regression function; 4) the bias-corrected DR estimator with correct 
propensity score function and incorrect regression function, where the specified regression 
function included only W ; and 5) the bias-corrected DR estimator with incorrect propensity 
score function and correct regression function, where the specified propensity score function 
included only W, but used the correct Cauchy link function 2 . The figure clearly shows 
that, although the extrapolation functions for the different mean estimators have slightly 

2. For the estimators involving the mean function for Y *, parameters of the tobit model were estimated 
using the tobit() function in the AER package (Kleiber and Zeileis, 2008) for the R environment, and 
m(X, Z) was computed using methods appropriate for the tobit model. 
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different curvatures, all converge to the actual population mean of Y* as A —> —1. Similar 
results hold for the same set of five estimators applied to the linear outcome Y (Figure 8 
in the Appendix) as well as for Y* and Y where the Cauchy link for the propensity score 
model in Equation 2 was replaced by a logistic link (Figures 9 and 10 in the Appendix). 

The large simulated dataset can also be used to assess the adequacy of different SIMEX 
extrapolation functions for any given estimator. For example, Figure 3 plots the true extrap¬ 
olation function for the bias-corrected DR estimator with incorrect propensity score function 
and correct regression function. It also includes quadratic and quartic approximations fit 
to the values for A > 0. Both fitted approximations fit the true extrapolation essentially 
perfectly for A > 0 but the quadratic fit diverges by a small amount as A approaches — 1 
and does not fully correct for the bias. This conservative behavior of the quadratic extrap¬ 
olation function is a known general result for SIMEX (Carroll et ah, 2006). Alternatively 
the quartic extrapolation diverges only trivially from the correct limiting value. The results 
are similar for the other estimators and settings, except that the divergences are smaller for 
the regression estimator than the others. 

The obvious limitation to this approach in practice is that P^ is unknown; if it were 
known then E[Y] could be computed directly. However, the method could be used in 
applications with an approximation to informed by the observed data. For example, 
rh(X, Z) and/or tt(X, Z ) could be specified, perhaps using models and parameters estimated 
with the data. Values for Z could be drawn from the empirical distribution, and values 
for X could be generated according to some hypothesized distribution conditional on Z. 
Next, Y and R could be generated using the working estimates of rh(X,Z) and if(X,Z), 
and finally, IT( 1 + a) could be generated using the ME distribution evaluated at different 
values of A. Sensitivity analyses could be done to explore alternative values for the models, 
parameters, and the distribution of X. These analyses could provide confidence that the 
required limiting behavior of the estimating equation solutions holds in models similar to 
those generating the observed data, and may inform the choice of the extrapolation function. 

4.5 Standard Error Estimation for the SIMEX Estimator 

The SIMEX estimator is a complicated function of the observed and simulated data. Thus 
some work is required to estimate standard errors. Carroll et al. (2006) suggest three 
general methods for SIMEX standard error estimation. The first is a resampling method 
such as the jackknife or bootstrap, in which the entire SIMEX procedure is replicated with 
resampled datasets. The second uses SIMEX projection itself to estimate the standard error. 
Suppose there is a typical way of estimating the sampling covariance matrix V(#( 1+ ^),6,n) 
of 0(i+A f ),&,n) f° r example, using a sandwich variance estimator. In addition, let s? 1+A ^ Bn 

equal the sample covariance matrix of {6(i+\ t )fi,n}b=i- Then Carroll et al. (2006) argue that 
projecting 

B 

(1 /B)XF(«i 1+ W ,n) (6) 

6=1 

to A = — 1 approximates the sampling variance for the final SIMEX estimator. The third 
method is to express the SIMEX estimator as the solution to a system of estimating equa¬ 
tions that include both the model parameters and the parameters of the extrapolation 
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Figure 3: Extrapolation function, and quadratic and quartic approximations, for the bias- 
corrected DR estimator with incorrect propensity score and correct regression 
functions. 


function, and to use asymptotic standard errors available from general estimating equation 
theory. Details of this approach are provided in Appendix B.4 of Carroll et al. (2006). 

Each approach has advantages and disadvantages. The resampling method is straight¬ 
forward to implement, albeit computationally burdensome, because the SIMEX procedure 
itself can be computationally intensive and each bootstrap or jackknife replication requires 
replicating the full SIMEX procedure. The projection method requires no resampling, but 
V must be available, an extrapolation function must be chosen, and the final standard error 
approximation is valid only for large n and small ME. The full estimating equation approach 
is asymptotically correct, but it generally will be tedious to construct and check all of the 
required terms. In Section 5 we consider the performance of the bootstrap and projection 
methods in a simulation study, and in Section 6 we use the bootstrap in the case study of 
middle school effects on student achievement. 
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5. Simulation Study of Estimator Performance 

The development to this point has not considered how well SIMEX mean estimators perform 
in finite samples. In this section we evaluate the performance of SIMEX estimators, and 
estimated standard errors, again in the context of the simulated scenario from Section 2. 

5.1 Performance of SIMEX Estimators 

We conducted four simulations using four different data sample sizes of n = 500,1000, 5000 
and 25000. For each sample size we generated M = 500 independent datasets, and for each 
such dataset we implemented 20 estimators for E\Y], the mean of the linear outcome. These 
were determined by the five classes of estimators used in the previous section (regression, 
IPW, and three different versions of bias-corrected DR), and each was implemented in four 
different ways: 1) We applied the estimator to (Y, R, Z\, Z 2 , X). We call this the “ideal” 
approach. It would not be feasible in practice and we include it to calibrate the performance 
of the feasible estimators that use IE(i). 2) We applied the estimator to (Y, R, Z\, Z 2 , Wm) 
without any correction for ME. We call this the “naive” approach. 3) We applied the 
estimator to (Y, R, Z\, Z 2 ,Wn\) using SIMEX with the quadratic extrapolation function, 
which we label “SIMEX(2)”; and 4) we applied the estimator to (Y, R, Z\, Z 2 , TY^)) using 
SIMEX with the quartic extrapolation function, which we label “SIMEX(4)”. Each of the 
SIMEX estimators was implemented using B = 500 generations of the synthetic data and 
a A grid consisting of 20 equally-spaced points from 0 to 2. 

We used the M independent replications of the simulation to calculate the bias, standard 
error (SE) and root mean squared error (RMSE) of each of these 20 estimators for E[Y] = 0 
and for each sample size n. These results are summarized in Figure 4. For each estimation 
class, the figure plots the bias, standard error, and RMSE. In each panel of the figure, the 
values for the ideal, naive and SIMEX estimators are plotted by n. The scale of the vertical 
axis is 100 times the scale of the outcome variable, so that, for example, a value of “6” on 
the vertical axis corresponds to 0.06 outcome units. Because the outcomes were constructed 
to have population standard deviation 1, the values on the vertical axis can be thought of 
as percentages of a standard deviation unit of the outcome. The combination of M — 500, 
B = 500 and a A grid consisting of 20 points led to Monte Carlo error in our estimates that 
was sufficiently small to not affect our substantive conclusions. 

The ideal estimator is effectively unbiased for all estimation classes and has the smallest 
RMSE for every estimator class and sample size. Alternatively, the naive estimator is always 
biased and has the largest RMSE for every estimator class and sample size. Both types 
of SIMEX estimators perform well under all conditions. SIMEX(2) removes most of the 
bias for all estimator classes. It is also relatively precise; its RMSE dominates the naive 
estimator and is not much worse than the ideal estimator for all classes and sample sizes. 
SIMEX(4) removes virtually all of the bias, but the extra flexibility of the extrapolation 
function adds a little variance and thus it does not improve on SIMEX(2) with respect to 
RMSE until the sample size is very large. 

We repeated the simulation for the tobit outcome Y* and obtained similarly successful 
performance of SIMEX. The results are summarized in Figure 11 in the Appendix and are 
extremely similar to those presented in Figure 4. SIMEX is again competitive with the 
ideal in terms of RMSE for all estimator classes and sample sizes. 
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Figure 4: Performance of SIMEX estimators for mean E[Y] of the linear outcome. Estima¬ 
tors include regression, IPW, bias corrected DR (BCDR) with correct mean and 
propensity score models (1), correct propensity score and incorrect mean models 
(2), correct mean and incorrect propensity score models (3). Within each panel 
the magenta line is for the “ideal” estimator, the red line is for the “naive” es¬ 
timator, the blue line is for “SIMEX(2)” with the quadratic extrapolation, and 
the green line is for “SIMEX(4)” with the quartic extrapolation. The scale of the 
vertical axis corresponds to percentages of a standard deviation unit of Y. 
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In addition to these 20 estimators assessed in Figures 4 and 11, we also evaluated the 
performance of the DR estimator of Kang and Schafer (2007) that uses weighted regression 
modeling with inverse propensity score weights. We did this using the same incorrect 
regression and incorrect propensity score function used for the bias corrected DR estimator. 
The performance of the ideal, naive and SIMEX estimators for this DR class were virtually 
identical to those of the bias corrected form for both the linear and tobit outcomes. 

Finally, we also considered a replication of the simulation reported in Figures 4 and 11 
where the reliability of was 0.70 rather than 0.85. These are reported in Figures 12 and 
13 in the Appendix. In both cases, the bias in the naive estimators was much larger and both 
SIMEX(2) and SIMEX(4) still dominated the naive estimators in terms of RMSE for all 
estimator classes and sample sizes. However unlike the case with reliability 0.85, SIMEX(4) 
had lower RMSE than SIMEX(2) for all estimator classes when the sample size was 1,000 
or more. This is because the larger ME variance leads to more extreme nonlinearity in the 
true extrapolation function, so the bias reduction afforded by SIMEX(4) is relatively more 
important. The conclusion that a more flexible extrapolation function may be beneficial 
when the ME variance is larger probably generalizes to many other applications of SIMEX. 

5.2 Performance of Standard Error Estimators 

In this section we consider the performance of the bootstrap and projection methods for 
feasible standard error estimation. Figure 5 presents results for the linear outcome with 
n = 500 and reliability 0.85. The estimator classes are organized into the same five blocks 
as in Figure 4, and the ideal, naive and two SIMEX estimators are labeled on the left in the 
same color scheme as used in Figure 4. For each of the 20 estimators, Figure 5 provides the 
following. The black filled circle is the standard deviation of the true sampling distribution 
of each estimator, estimated using the M = 500 Monte Carlo simulations described in the 
previous section. This value is used as the benchmark for the feasible estimators because 
it is correct up to Monte Carlo error from the 500 simulated datasets. To evaluate the 
performance of the bootstrap standard error estimator, we selected a random sample of 10 
of the 500 simulated datasets, and for each of the samples, we then computed the bootstrap 
standard deviation of each of the mean estimators using 100 bootstrap samples. These 
values are plotted with the gray open circles, and their means across the 10 datasets are 
plotted with the gray triangle. To evaluate the performance of the projection standard 
error estimator, we used estimating equation theory to derive V for the regression, IPW 
and bias-corrected DR estimators. Details for these computations, which are more broadly 
applicable to estimating standard errors of these population mean estimators in the case of 
no measurement error, are provided in the Appendix. We used these values to estimate the 
standard error of the ideal and naive estimators for each of M* = 200 random datasets, and 
also used these values to compute the projection standard error estimators for the SIMEX 
estimators. These values are plotted with the open dark yellow circles in Figure 5, and their 
means across the 200 datasets are given by the orange squares. 

The bootstrap standard errors appear to function quite well. They appear to be essen¬ 
tially unbiased for the true standard errors for all estimator classes. The standard error 
estimators using V appear to be reasonable for the ideal and naive estimators, but slightly 
biased downward when used in the projection calculation for the SIMEX estimators, par- 
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Figure 5: Summary of estimated standard errors for different estimators of the mean of the 
linear outcome, with n = 500 and reliability of W(i) equal to 0.85. Full description 
in the text. 


ticularly for quartic SIMEX. This may be due to the fact that the projection calculation 
is generally valid only when the sample size is large and the ME variance is small. Plots 
similar to Figure 5 but for other scenarios including n = 1000, the tobit outcome, and 
reliability of 0.70 are given in Figures 14 to 20 in the Appendix. Due to computational 
limitations, we considered only n = 500 and 1000. These alternative scenarios replicate the 
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basic conclusions from the analyses presented in Figure 5. The bootstrap standard errors 
function well across scenarios, while the projection standard errors appear to be negatively 
biased, particularly for the quartic SIMEX estimator. These results suggest that the boot¬ 
strap is probably the safest approach to standard error estimation in practice, thus we use 
it in our empirical example in Section 6. 

6. Example: Estimating Middle School Effects 

This section summarizes an example in which we use SIMEX to estimate population average 
treatment effects on student math achievement for middle schools in a large suburban school 
district. The data include achievement test scores and demographic characteristics for 6,552 
grade 6 students nested in 24 middle schools. The covariates Z% for each student i include 
indicator variables for whether the student is black, Hispanic, male, eligible for special ed¬ 
ucation, eligible for gifted instruction, and eligible for free or reduced price lunches (FRL). 
The observed grade 5 math and language test scores (W^math) ff/jang) are assumed to be 
error-prone measures of latent grade 5 math and language achievement (Xj m a t h, -W iang ). 
The observed test scores are on a standardized scale where the district mean is zero and 
the district standard deviation is 1, so that score points are in student-level standard de¬ 
viation units. The number of grade 6 students in each middle school ranges from 142 to 
371 (mean 273) and there is substantial variation across schools in the distributions of 
(Zj, Wj ima th, W^iang). For example, the school-level averages of Hamath range from about 
—0.42 to about 0.70, so that schools range in average prior math achievement by more than 
1 student-level standard deviation unit of observed scores. Similarly, the school percentages 
of FRL-eligible students range from 6% to 85%. 

The outcome of interest is the grade 6 math achievement test score, also standardized 
to have mean zero and standard deviation 1 in the district. Consistent with the potential 
outcomes framework for causal effects, we assume that each student i has a potential grade 
6 math achievement outcome Y lyS for each school s = 1,..., 24, indicating what the grade 6 
test score would have been for student i had he or she attended school s 3 . 

We define the causal effect of school s for student i as — (1/24) Ylt=i ^i,tl i- e the 
difference between the potential outcome under school s and the average of the potential 
outcomes across all schools. We assume that for each s, Y/ s are IID samples from the 
population of students and we denote E\Yi yS ] by fj, s , the mean grade 6 test score that would 
be observed if all students in the population attended school s. Using the definition of the 
individual causal effects, the population average effect of school s is A s = fj, s — JL where JL is 
the equally-weighted mean of /z s across the 24 schools. Estimating {A s } requires estimating 
the 24 population means fi s in the presence of nonrandom missing data because Y t s are 
observed only for the corresponding subpopulations of students who attended each school. 
That is, because each student attended only one school, F) jS is observed only for the school 
s that student i attended, and l/,i is missing for the 23 schools where t ^ s. 


3. Implicit in our use of potential outcomes is the SUTVA or no-interference assumption that students’ 
outcomes do not depend on other students assigned to their school (Rubin, 1980). The plausibility of 
this assumption can be questioned. But regardless of whether or not it holds, failure to properly control 
for selection effects due to ME in the covariates would contribute to bias in the estimated school effects. 
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Grade 6 is the first year of middle school in this district. The grade 5 test scores 
(Wj )m ath, Wj.iang) were measured while the students were still in elementary school. Because 
middle school selection decisions are unlikely to be based on the observed test scores, it is 
reasonable to assume that observed differences across middle schools in grade 5 test scores 
reflect differences in latent achievement (^i,math, ^G,iang)- It is also reasonable to assume 
that the relationship between the grade 6 test scores and (R% mat h, Hsiang; Zi) is driven 
by the relationship between the grade 6 test scores and (Xj jmat h, ^G,iang) Zi). Thus, given 
the available covariates, it is probably more reasonable to assume strong ignorability holds 
conditional on (^G jma th, ^*,iangi Zi) than to assume that it holds conditional on the observed 
covariates (W^ mat h, Wj,i a ng, Zi), making the methods developed in this article appropriate. 

A challenge with applying SIMEX in this context is that test score ME is heteroskedastic 
with a complicated error structure. The observed test scores (Wj iina th, W?;.i ang ) are estimates 
of latent abilities (-Xi, ma th)-^i.iang) obtained from item response theory (IRT) models (Lord, 
1980; van der Linden and Hambleton, 1997). Such estimates have the property that the ME 
variance of mat h) for example, varies as a function of X, mat h according to a known func¬ 
tion <7math (A*,math) called the squared “conditional standard error of measurement (CSEM)” 
function (Lord, 1980). The CSEM function for a given test is determined by the number 
of items and their psychometric characteristics. These functions are available in our data. 
Using the methods of Lockwood and McCaffrey (2014), we estimate that the average relia¬ 
bilities of the grade 5 math and language test scores in this population are 0.87 and 0.84, 
respectively. These are consistent with the values considered in our simulation studies. 

The difficulty with using the CSEM function in SIMEX is that the ME variance for 
any individual student’s score is unknown because it depends on the latent achievement. 
It is thus unclear what ME variance to use for each test score during the simulation phase 
of SIMEX. Lockwood and McCaffrey (2015b) evaluate different plug-in estimators for the 
unknown variances 5'math(^i,math) and 5iang(^*,iang) for their effectiveness in SIMEX ap¬ 
plications. Their analyses suggest that treating estimates of E[g mat h(Xj jmat h)|IDi )ma th] and 
E[g i a ng (X,; i a ng )|m. l an g] as the known variance for each test score results in SIMEX estima¬ 
tors with reasonable properties. They discuss how to estimate these quantities from the 
observed data, and we used those methods here to compute ME variances of math and of lauo , 
to use for each student’s grade 5 test scores during SIMEX. 

To estimate the effects, we used quadratic SIMEX with the bias-corrected DR estimator 
to estimate each of the 24 means fj, s and then used these to estimate the effects {A s }. 
For each school we specify the mean as linear in Xj ^ z , Zj), which is reasonable 

given the generally high linear correlations between test scores from the same students across 
years. We specify the propensity score model using a linear predictor in (W. mat h, Ayi ang . Zi) 
and the Cauchy link function. The SIMEX estimates for {// s } were obtained using B = 500 
generations of the synthetic data and a A grid consisting of 20 equally-spaced points from 
0 to 2 for each school. Standard errors were computed using 100 bootstrap samples of the 
students and applying quadratic SIMEX to estimate {// s } for each sample. The bootstrap 
distributions of the estimated effects {A s } were unimodal and approximately symmetric, 
so that 95% confidence intervals using ±1.96 bootstrap standard errors are suitable. 

Figure 6 presents the estimated {A s }, where schools are sorted by their estimated effects. 
It also gives 95% bootstrap confidence intervals. The effects for most schools are statisti¬ 
cally significantly different from zero and range from approximately —0.3 to 0.3 standard 
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Figure 6: 


School effects estimated by quadratic SIMEX, with 95% bootstrap confidence 
intervals given by gray bars. 


deviation units of the grade 6 math score. A method of moments estimate of the variance 
among schools effects (Morris, 1983) indicates that schools account for about 3% of the total 
variance in the grade 6 math scores. This is consistent with other research on the magnitude 
of between-school variation in achivement growth (e.g., Kane and Staiger, 2002). 

Because the truth is unknown, we have to use indirect evidence to assess whether SIMEX 
improved the estimates of {A s } relative to the naive estimates. One of the consequences 
of failing to correct for covariate ME in estimates of group effects is that the resulting bias 
in the estimates tends to be positively correlated with group means of the latent covariates 
(Fuller, 2006) - that is, estimates for schools with high means of (X mat h, Xi ang ) are too 
high and those for schools with low means are too low. Therefore, if SIMEX is effective at 
reducing bias in the estimated school effects, it should be the case that SIMEX tends to 
decrease the naive estimates for schools with high average prior achievement and increase 
the naive estimates for schools with low average prior achievement. To examine this, we 
follow Lockwood and McCaffrey (2014) by using our longitudinal data to compute the 
school averages of the grade 5 test scores from the prior year cohort of grade 6 students. 
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This is an entirely different group of students than those used to compute the school effects. 
Thus any relationship of the differences between the naive and SIMEX estimates based on 
the estimation cohort of students, and the school averages of the incoming test scores for 
the prior year cohort, indicates systematic differences, rather than any spurious correlation 
resulting from computing estimates for a group of students and then relating the estimates 
to characteristics of those same students. 

Figure 7 plots the differences between quadratic SIMEX estimated school effects and 



Figure 7: Differences between quadratic SIMEX estimated school effects and naive esti¬ 
mated school effects as a function of school averages of standardized grade 5 math 
scores from the prior cohort of students. 95% bootstrap confidence intervals for 
the differences given by gray bars. 


naive estimated school effects as a function of school averages of grade 5 math scores from 
the prior cohort of students. An advantage of using the bootstrap over alternative standard 
error computations is that it automatically provides uncertainty estimates for complicated 
functions of the data and estimated parameters; in this case we present 95% bootstrap con¬ 
fidence intervals for the differences between the SIMEX and naive estimated school effects. 
The figure demonstrates the predicted pattern of SIMEX tending to increase the estimates 
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for schools with low average prior achievement and decrease the estimates for schools with 
high average prior achievement. Many of these changes are statistically significant. The ad¬ 
justed R 2 of a quadratic regression of the differences on the average prior math scores is 0.66 
with a 95% bootstrap confidence interval of (0.51, 0.81), clearly indicating a relationship 
between general achievement attributes of schools’ students and the corresponding changes 
in the estimated effects due to ME correction. This suggests that the SIMEX estimates are 
indeed less biased than the naive estimates. 

Although the adjustments due to SIMEX are sensible, they are of relatively modest 
magnitude in this example. The standard deviation among the SIMEX estimated school 
effects is 0.182 standard deviation units of student achievement, and the standard deviation 
of the changes is 0.021, so that the magnitude of the changes is about 12% as large as the 
magnitude of the effects. The modest change in this context makes sense given that the 
prior tests are fairly reliable and there are two of them, helping to reduce bias in the naive 
estimators caused by ME (Lockwood and McCaffrey, 2014). However, given that a primary 
concern about school or teacher accountability measures is that they will contain systematic 
errors correlated with student background characteristics such as prior achievement (Braun, 
2005), there may be value in removing this bias even if its magnitude is only modest. 

7. Discussion 

Estimation of a population mean from a nonrandom sample with error-prone covariates 
can be achieved through a straightforward application of the SIMEX method. However, it 
is distinct from common applications of SIMEX, which typically estimate the parameters 
of parametric models. Estimating a population mean from a nonrandom sample requires 
estimating parameters as well as evaluating functions of those parameters at various values 
of the data. An advantage of SIMEX is that no special adaptations of the method are 
necessary for this novel application in which alternative approaches to dealing with error- 
prone covariates are generally challenging. 

Our theoretical development addressed the case where the ME is homoskedastic with 
variance a 2 . However, heteroskedastic measurement error commonly arises in applications, 
including psychometrics (Battauz and Bellio, 2011), economics (Sullivan, 2001) and bio¬ 
statistics (see Delaigle <fc Meister, 2007 and references therein). SIMEX extends straight¬ 
forwardly to the case where the ME variance is heteroskedastic with known variances a 2 for 
each unit by using a 2 to generate the synthetic data for each unit. However, it is common in 
applications for a 2 to be unknown, such as the example with IRT-based achievement scores 
in Section 6. Appropriate SIMEX methods for the case of unknown a 2 when replicate mea¬ 
sures Wij are available for each Xi are provided by Carroll and Wang (2008), Devanarayan 
and Stefanski (2002) and Wang et al. (2009), and could be applied in the current setting. 
As noted in Section 6, applying SIMEX in the heterskedastic case where the ME variance 
is a known function of A* is addressed by Lockwood and McCaffrey (2015b). 

The need to know or make assumptions about the ME distribution is a fundamental 
limitation of correcting regression, weighting or DR estimators of a population mean for 
covariate measurement error. Outside of limited special cases, both the methods reviewed 
in Section 3 and SIMEX require such information. How this information is obtained will 
depend on the setting. In some cases validation data containing X itself are available, and in 
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other cases there may be replicate error-prone measures from the same unit that can be used 
to assess the ME distribution (Carroll et al., 2006). As noted in Section 6, in educational 
and psychometric applications involving trait estimates from IRT or factor analytic models, 
information about ME distribution is commonly available from the analysis methods used 
to create the observed scores. The validity of ME-corrected estimators such as SIMEX 
hinges on the ME assumptions being reasonable. With SIMEX and other methods, using a 
value of a 2 that is too small will tend to undercorrect the naive estimator for ME, and using 
a value that is too large will tend to overcorrect. A useful feature of SIMEX is that the 
projection process provides insight into how much, and in what direction, ME matters for 
the target estimand. If a 2 is unknown, the SIMEX procedure can be applied with different 
plausible values in a sensitivity analysis. As in Section 6, the bootstrap could be used in 
such settings to create a confidence interval for the difference between the uncorrected and 
corrected estimates. 

Another difficulty with any method in this context is the correct specification of either 
ir(X, Z ) or m(X, Z ) when there is ME. This is because ME obscures relationships (Carroll 
et al., 2006). In particular, common methods for building the propensity score model, such 
as adding terms until the R = 0 and R = 1 covariate distributions are balanced, do not work 
because balance of W(\) does not guarantee balance of X. Due to these challenges to model 
building and the increased likelihood of model misspecification, DR may be comparatively 
more attractive in the presence of ME than in the standard case. DR is also appealing 
over IPW when there is ME because ME can reduce the efficiency of the estimator relative 
to estimation with error-free covariates. The regression component of the DR estimator 
can improve efficiency and potentially offset the extra uncertainty created by ME. SIMEX 
is particularly valuable for DR estimation because implementing DR using the methods of 
McCaffrey and Lockwood (2014), which were detailed in Section 3, is challenging in practice, 
whereas SIMEX is straightforward. 

A key limitation specific to SIMEX is the projection to A = — 1. Several analyst decisions 
are required to conduct this projection, including the choice of the number B of synthetic 
data replications, the grid of A values, and the functional form of the projection function. 
Each of these choices can affect the final estimator in terms of bias, variance, or both. The 
choice of the projection function is most difficult. For example, as demonstrated in Figure 3, 
different extrapolation functions may fit the generated data essentially equivalently but still 
lead to different projections. If the estimates are relatively insensitive to the values of A, 
projection may risk introducing variance without sufficient bias reduction to compensate. 
Work needs to be done on using data from the sample of estimates across values of A 
to decide whether or not to project, and what projection function to use. Alternatively, 
combinations of the naive and SIMEX estimators might yield more accurate estimates, and 
data from the sample of estimates across values of A may be used to determine how to weight 
the two components. Whether or not such methods could improve on SIMEX remains an 
area for future research, but such methods may be particularly valuable in the estimation of 
means with nonrandom samples because such estimates can be noisy and because SIMEX 
may be the most applicable method for handling ME in such problems. 

Future work might also consider alternative methods for weight estimation such as 
boosted models (McCaffrey et al., 2004b), covariate balancing propensity scores (Irnai and 
Ratkovic, 2014), the super learner (Pirracchio et al., 2015), or minimum discriminant infor- 
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mation adjustment (Haberman, 1984; Hainmueller, 2012). These approaches tend to yield 
weights which better reduce differences among groups or yield more precise estimates of the 
mean than traditional weights based on logistic regression. SIMEX is likely to be the most 
practically feasible solution for many of these approaches in applications with error-prone 
covariates, and determining if their benefits extend to such applications is an area for future 
research. Similarly, matching is also sometimes used for estimating means in the context of 
causal modeling. Lockwood and McCaffrey (2015a) show that matching in the presence of 
error-prone variables can be quite challenging. SIMEX would be straightforward to apply 
to matching, and the close relationship between matching and weighting suggests that the 
resulting estimators may perform well, but additional study is needed. 
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Appendix 

Data Generation Code for Simulation Examples 

datagen <- function(n=1000, reliability=0.85, seed){ 
set.seed(seed) 

## Generate X using mixture of normals and set E[X]=0 and Var[X]=l 

mixprops <- c(.275, .475, .0666, .0667, 0.0667, 0.05) 

stopifnot(sum(mixprops)==1) 

mus <- c(0, -2, 2.25, 3.25, 4.25, -6) 

vs <- c(l, 1, .25, .25, .25, .25) 

xmean <- sum(mixprops*mus) 

xvar <- sum(mixprops*vs) + sum(mixprops*(mus-xmean)*2) 

mix <- sampled:length(mixprops), size=n, replace=TRUE, prob=mixprops) 
d <- data.frame(x = rnorm(n, mean=mus [mix], sd=sqrt(vs[mix]))) 
d$x <- (d$x - xmean)/sqrt(xvar) 

## Generate Z1 and Z2 

d$zl <- .3*d$x + sqrt(1-.3*2)*rnorm(n) 
d$z2 <- as.integer(runif(n) < 0.5) 

## Generate error-prone measure W 
sigsq <- (l-reliability)/reliability 
d$w <- d$x + rnorm(n, sd=sqrt(sigsq)) 

## Generate R 

d$p <- pt(0.5 + 1.2*d$x + 0.5*d$zl - 1.0*d$z2 + 0.7*d$x*d$z2, df = 1) 
d$R <- as.integer(runif(n) < d$p) 

## Generate Y (linear outcome) subject to E[Y]=0, Var[Y]=l, and covariates 
## having R~2=0.80. 

## "vary" is true population variance of explained part of Y 

## "vare" is selected to give R*2 = 0.80 

d$y <- -0.20 + 1.0*d$x + 0.6*d$zl + 0.4*d$z2 

vary <- 1.0*2 + 0.6*2 + 2*(1.0)*(0.6)*0.3 + (0.4*2)*(1/4) 

vare <- 0.25*vary 

d$y <- d$y + rnorm(n, mean=0, sd=sqrt(vare)) 
d$y <- (l/sqrt(vary + vare)) * d$y 

## Generate Y* (tobit outcome conditional on covariates) by censoring a linear 
## transformation of Y where the parameters of the linear transformation are 
## chosen so that E[Y*] and Var[Y*] are both approximately 1 
d$ystar <- (0.812 + 1.255*d$y) * as.numeric(0.812 + 1.255*d$y > 0) 
d$ystar0 <- as.integer(d$ystar==0) 
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## print summaries 
cat("\nMarginal\n") 

print(t(sapply(d, function(x){ c(mean = mean(x), sd = sd(x))}))) 
cat("\nR=0 vs R=1 means\n") 

print(t(sapply(d, function(x)-[ tapply(x, d$R, mean) }))) 
return(d) 

> 


Technical Conditions for Section 4 

Let Q = (y, R, Z ) to simplify notation. Here we establish sufficient conditions for the 
solution of E[ip(Q, VF( 1+ a), 9)\ = 0 to converge to the solution of E[ip(Q, X, 9)] = 0. Assume: 

1. 9 takes values in a compact set © C TZ h+1 

2. ip(Q, x, 9) is continuous in x for every 9 G 0 

3. E[ip(Q,W( i + A),#)] and E[ip(Q, A, 9)] exist for every 9 £ 0, and E[ip(Q 1 Wi + \,9)\ is 
continuous in 9 

4. For each A, #(i+a) is the unique element of 0 satisfying E[i/j(Q, kF( 1+ ^), 9)\ = 0, and 

is the unique element of 0 satisfying E[ip(Q, X, 9)] = 0 

5. limA^-i E[tp(Q, PFi+A) Q)\ exists for every 9 £ 0 

6. The following interchange of limit and integration holds for each 0£0: 

lim E[^(Q,W {1+X) ,9)} = E[ lim i/j(Q,W (1+x) ,9)} 

A—► — 1 A—►—1 

and this convergence is uniform in 9. 

Under these conditions limA^_i $<i+a) exists and is equal to #*• 

Proof: As A —■> —1, (Q,W( i+a)) —> (Q,A). Assumption 2 and the continuous map¬ 
ping theorem give that ip(Q, IU( 1+ a)> 6) i p(Q,X,9) for every 9 S 0. By Assump¬ 
tions 3, 5 and 6, the functions /i( 1+ A)(0) = E[iJ>(Q,W(i + \),8)] are continuous in 9 and 
converge uniformly to h^(9) = E[iJ>(Q, X,9)\ as A —> —1. Assumption 1 guarantees that 
{#(i_i_A)} has at least one limit point 9 G 0 and there is a subsequence {0( 1 + a*)} that 
converges to 9. Assumption 6 gives that for any e > 0, for A* sufficiently close to — 1, 
\h(i+x*)(6) ~ %>)(#)! < e for all 9. In particular then |/i ( i +a »)(0(i + a*)) - fyo)(0(i+A*))l = 
IV)( 0 d+A*))l < e so that limA*->-i fyo)(0(i+A*)) = 0- The uniform convergence also gives 
that h(ty(9) is continuous and so h^(9) = 0. By Assumption 4, 0* is the unique 0 of /i( 0 ) 
and so 9 = 9* must be the unique limit of {0( 1+/)v )}. 
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Assumption 1 is common in developing asymptotics for parametric models (e.g. Wald, 
1949). The remaining assumptions are rather benign other than Assumption 6. The inter¬ 
change may be reasonable given that standard estimating equation asymptotics require 1/5 
to have a second moment at least at the true value of the parameter (Stefanski and Boos, 
2002). The uniform convergence would generally be very strict but might be less so given 
Assumption 1. In any case, pointwise convergence of functions is not sufficient to guarantee 
convergence of zeros. 


Details for Linear Regression Estimator in Section 4 

Here we provide the details on the solution to E[ip(Y, R,W±+ X , 9)\ = 0 for the linear re¬ 
gression example of Section 4, where E[ip(Y, R, W\ + \, 6)} is given in Equation 5. Solv¬ 
ing (E[Y \ R = 1] - S 0 - 5\E\Wi + \ | R = 1 })p(R = 1) = 0 yields <5 0 ,i +a = E\Y \ 
R = 1] - 5 1)1+X E[W 1+X | R = 1]. Also, E[Y \ R = 1] = <5* 0 + 6^E[X \ R = 1] and 
E[Wi + \ | R = 1] = E[X j R = 1], since U and U\ are independent of R. This gives 
<5o,i+a = <5*o + <5*i E[X | R = 1] - 5 l)l+x E[X \ R = 1]. 

We can now use (E[W 1+X Y \ R = 1] — 5 0 E[W 1+X J R = 1] - 8\E[W 2 +X | R = 1 ])p(R = 
1) = 0 to solve for (5i,i+a- Plugging in <5o,i+a for So gives 

0 = E[W 1+X Y | R = 1] - S 0 ,i +X E[W 1+X | R = 1] - ^.i+a^i+a i R=1 } 

= E[W 1+X Y \R=1)~ ( E[Y | R = 1] - S ltl+X E[W \ R = l])E[W 1+x \ R = 1] - 
«5i, 1+ a£[W 1 2 +a | R = 1] 

= E[W 1+X Y | R = 1] - E[Y | R = 1}E[W 1+X \ R = 1] — 

«5i, 1+ a(E[W 1 2 +a \R = 1]~ E[W 1+x | R = l] 2 ) 

= S* 0 E[W 1+X \R=1} + S^E[W 1+x X \R=1]~ <5* 0 E[W 1+A | R = 1] - 

6*\E[X | R = 1}E[W 1+X | R = 1] - <5 1i1+ a(£[W 2 +a | R = 1] - E[W 1+X \ R = l] 2 ). 


This yields 


<5*i(E[W 1+a X \R = 1\-E[X\R = 1\E[W 1+X 1 R = 1]) 
E[Wl +x | R = 1] - E[W 1+X \R = l} 2 


We also have E[W\ +x X \ R = 1] = E[X 2 \ R = 1], so the numerator equals E[X 2 \ R = 
1] - E[X | R = l] 2 . Furthermore, E[W 2 +X \ R = 1] = E[X 2 \ R = 1] + E[{U + U x ) 2 \ R = 
1] = E[X 2 | R = 1] + (1 + A)cr 2 . This gives, 

<5*i(£[A: 2 | R = 1] - E[X | R = l] 2 ) 

1,1+A “ E[X 2 | R = 1] - E[X | R = i]2 + (l + A)u 2 ‘ 

We let v 2 = E[X 2 j R = 1] — E[X \ R = l] 2 and turn to /x — <5q — 5\E\Wi +x } = 0 to obtain 


/M+A — <5*o + <5*i (1 — 


v 2 + (1 + A)cr 2 


)E[X \R = 1] 


<5*i^ 2 


v 2 T (1 T A)<t 2 


E[X\, 


where the last term holds because E\W\ +X \ = E[X\. 
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Demonstration of Consistency of SIMEX 

This section demonstrates extrapolation functions for other outcomes and selection models, 
analogous to Figure 2 in Section 4.4. 





Figure 8: Extrapolation functions for simulated example analogous to Figure 2, but for the 
linear outcome. 
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Figure 9: Extrapolation functions for simulated example analogous to Figure 2, but for the 
logistic propensity score function. 
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Figure 10: Extrapolation functions for simulated example analogous to Figure 2, but for 
the linear outcome and the logistic propensity score function. 
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Performance of Estimators Under Alternative Scenarios 

This section presents summaries of estimator performance for alternative outcomes and 
covariate reliability, analogous to Figure 4 in Section 5. 
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Figure 11: Summary of simulation results analogous to Figure 4, but for tobit outcome. 

Lines: magenta - “ideal”; red - naive; blue - SIMEX(2); green - SIMEX(4). 
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Figure 12: Summary of simulation results analogous to Figure 4, but with reliability of 

equal to 0.70. Lines: magenta - “ideal”; red - naive; blue - SIMEX(2); green - 
SIMEX (4). 
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Figure 13: Summary of simulation results analogous to Figure 4, but with tobit outcome 
and with reliability of Wm equal to 0.70. Lines: magenta - “ideal”; red - naive; 
blue - SIMEX(2); green - SIMEX(4). 
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Details for Sandwich Standard Error Computations 

In this section we use standard results for estimating equations to develop variance estima¬ 
tors for each of the population mean estimators. These values can then be inserted into the 
projection formula in Equation 6 to obtain standard error estimators for the SIMEX estima¬ 
tors. As noted in Section 3, 0 n solves n^ 1 V’fXo %%■> X%, 0) = 0, for the appropriate 
vector of functions ip. For an estimator 0 n that solves n^ 1 Zi, -W, 0) = 0, 

the asymptotic variance can be estimated consistently by n _1 A _1 B (A -1 ) , where A = 
n~ 1 T,l=i-'l’(Yi,Ri,Zi,X i X), with i'p(Y, R,Z, X, 9) = dip(Y, R, Z, X, 0)/36', and B = 
n~ l YH=i Zi,Xi, 6 n ) , ip(Yi, R4, Zi,Xi, 0 n )'. With error-prone measurements, includ¬ 

ing the measurements with additional error used during SIMEX, the same formulas can be 
used to estimate the asymptotic variance V(9^ + x e ),b,n) °f #(i+a e ),b,n by replacing 6 n with 
0(i+\ e ),b,n and Aj with VF* ]+a ^ ): h in the expressions for A and B. 

We now derive ip and ip for each of the estimators. For these derivations we let we 
D = (1 ,Z',X'y. A similar development is provided by Hirano and Imbens (2001), and 
a summary of general results on estimating standard errors of parameters estimated via 
estimating equations is given by Stefanski and Boos (2002). 

Regression 

As shown in the article, for the regression estimator, with a linear model for the mean, 
the estimating functions for 6 = (/q 6')' are 

^(Y,R,D,0 )= ^ R - Y m D , s)D y 

This yields 



Inverse Probability-of-Response Weighting 

For IPW, we assume that the propensity scores are modeled by a generalized linear 
model with a Cauchy link. If G(u) is the CDF of the Cauchy distribution, then G(u) = 
7r _1 arctan(u) + 0.5, G(u) = 7r~ 1 (l + u 2 ) -1 . The estimating functions for 6 = (/r, a')' are 


iP(Y,R,D,d) 


R(Y - yL)[G(D'a) -1 } 

(R - G(D'a))G(D'a) [G(D'a)( 1 - G{D'a))}~ 1 D 


This yields 


iP(Y,R,D,9) 


-RGiD'a)- 1 -R{Y - p)G{D'a)[G(D'a)- 2 ]D' 
0 Q(R,D,a)DD' 


where 

Q(R,D,a ) 



1 

G(D'a) 2 


+ 


2n (D'a) 
G(D'a) 


+ (i - R) 


l 

(l-G(D'a )) 2 


2'k{D'o) 

1 - G(D'a) 


G(D'a) 2 . 
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Doubly Robust 


For the DR estimators, we again assume that the propensity scores are modeled by a 
generalized linear model with a Cauchy link, and that a linear model is used for the mean. 
We also allow for the covariates (or functions of the covariates, such as interactions) used to 
model the mean to differ from those used to model the propensity score. We let D\ denote 
the vector of variables used in modeling the mean, including 1 for the intercept, and we let 
D 2 denote the vector of variables used in modeling the propensity score, again including 1 
for the intercept. The estimating functions for 6 = (/z, S', a')' are then given by 


i/j(Y, R, Di, D 2 , 0) 


( R(Y - D'^G^a)- 1 } - r( M - D[5) \ 

R(Y - D\8)D\ 

(R - G{D' 2 a))G{D' 2 a) {G{D' 2 a){ 1 - G(D' 2 a))]~ l D 2 
\ R[G(D 2 a)~ 1 } - t 


This yields 


Tj>(Y,R,D u D 2 ,e) 


( -r (r - R[G(D' 2 a)~ 1 ])D , 1 
0 —RDiD[ 

0 0 

^ 0 0 


—R(Y - D[S)G(D' 2 a)[G(D' 2 a)- 2 ]D 2 D\8 - y 

0 0 

Q(R,D 2 ,a)D 2 D' 2 0 

-RG{D' 2 a)[G{D' 2 a)- 2 }D' 2 -1 
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Performance of Standard Error Estimators Under Alternative Scenarios 

This section presents summaries of performance of standard error estimators for alternative 
outcomes, sample sizes and covariate reliability, analogous to Figure 5 in Section 5. Note 
that for the tobit outcome, in Figures 17 to 20, standard errors using the projection method 
are not presented because we did not compute the analytical sandwich standard errors for 
the tobit outcome mean estimators. 


g Ideal 
$ Naive 

CD 

d) S(2) 

CD 

S(4) 

Ideal 


^ Naive 
- S(2) 
S(4) 
_ Ideal 
of Naive 

8 S(2) 

00 

S(4) 

_ Ideal 

C\J 

of Naive 
8 S(2) 

DO 

S(4) 
_ Ideal 

CO 

of Naive 

8 S(2) 
oo 

S(4) 



Standard Error 


Figure 14: Summary of estimated standard errors for different estimators of the mean of the 
linear outcome, with n = 500 and reliability of W (d equal to 0.70. Analogous 
to Figure 5. 
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Figure 15: Summary of estimated standard errors for different estimators of the mean of the 
linear outcome, with n = 1000 and reliability of equal to 0.85. Analogous 
to Figure 5. 
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Figure 16: Summary of estimated standard errors for different estimators of the mean of the 
linear outcome, with n = 1000 and reliability of equal to 0.70. Analogous 
to Figure 5. 
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Figure 17: Summary of estimated standard errors for different estimators of the mean of 
the tobit outcome, with n = 500 and reliability of W (\) equal to 0.85. Analogous 
to Figure 5. 
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Figure 18: Summary of estimated standard errors for different estimators of the mean of 
the tobit outcome, with n = 500 and reliability of W (\) equal to 0.70. Analogous 
to Figure 5. 
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Figure 19: Summary of estimated standard errors for different estimators of the mean of the 
tobit outcome, with n = 1000 and reliability of equal to 0.85. Analogous 
to Figure 5. 
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Figure 20: Summary of estimated standard errors for different estimators of the mean of the 
tobit outcome, with n = 1000 and reliability of equal to 0.70. Analogous 
to Figure 5. 
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